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ABSTRACT The site-to-site variability in species composition, known as /8-diversity, is crucial to understanding spatiotemporal 
patterns of species diversity and the mechanisms controlling community composition and structure. However, quantifying 
/3-diversity in microbial ecology using sequencing-based technologies is a great challenge because of a high number of sequenc- 
ing errors, bias, and poor reproducibility and quantification. Herein, based on general sampling theory, a mathematical frame- 
work is first developed for simulating the effects of random sampling processes on quantifying /3-diversity when the community 
size is known or unknown. Also, using an analogous ball example under Poisson sampling with limited sampling efforts, the 
developed mathematical framework can exactly predict the low reproducibility among technically replicate samples from the 
same community of a certain species abundance distribution, which provides explicit evidences of random sampling processes as 
the main factor causing high percentages of technical variations. In addition, the predicted values under Poisson random sam- 
pling were highly consistent with the observed low percentages of operational taxonomic unit (OTU) overlap ( < 30% and < 20% 
for two and three tags, respectively, based on both Jaccard and Bray-Curtis dissimilarity indexes), further supporting the hy- 
pothesis that the poor reproducibility among technical replicates is due to the artifacts associated with random sampling pro- 
cesses. Finally, a mathematical framework was developed for predicting sampling efforts to achieve a desired overlap among rep- 
licate samples. Our modeling simulations predict that several orders of magnitude more sequencing efforts are needed to achieve 
desired high technical reproducibility. These results suggest that great caution needs to be taken in quantifying and interpreting 
/3-diversity for microbial community analysis using next-generation sequencing technologies. 

IMPORTANCE Due to the vast diversity and uncultivated status of the majority of microorganisms, microbial detection, character- 
ization, and quantitation are of great challenge. Although large-scale metagenome sequencing technology such as PCR-based 
amplicon sequencing has revolutionized the studies of microbial communities, it suffers from several inherent drawbacks, such 
as a high number of sequencing errors, biases, poor quantitation, and very high percentages of technical variations, which could 
greatly overestimate microbial biodiversity. Based on general sampling theory, this study provided the first explicit evidence to 
demonstrate the importance of random sampling processes in estimating microbial /3-diversity, which has not been adequately 
recognized and addressed in microbial ecology. Since most ecological studies are involved in random sampling, the conclusions 
learned from this study should also be applicable to other ecological studies in general. In summary, the results presented in this 
study should have important implications for examining microbial biodiversity to address both basic theoretical and applied 
management questions. 
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Microorganisms appear to be the most diverse group of life 
presently known, inhabiting almost every imaginable envi- 
ronment on Earth (1). They play integral and unique roles in 
ecosystem functions and biogeochemical cycling of carbon (C), 
nitrogen (N), sulfur (S), phosphorus (P), and various metals. Un- 
derstanding the structure, functions, interactions, stability, and 
adaptations of microbial populations/communities is crucial for 
basic science discovery (2, 3), biotechnology (4), agriculture (5), 
energy (6), the environment (7), and human health (8). However, 
due to their extremely high diversity and as-yet-uncultivated sta- 



tus, characterizing microbial diversity and establishing the link- 
ages between microbial diversity and ecosystem function are very 
challenging (9). Understanding the mechanisms controlling mi- 
crobial diversity and functions is even more difficult. The recent 
advances in metagenomics, which has emerged as a cutting-edge 
21st century science (10), and associated metagenomics technol- 
ogies such as high-throughput sequencing and microarrays (10, 
1 1 ) provide revolutionary tools to address these challenges. Large- 
scale high-throughput sequencing-based metagenomics is pro- 
viding unprecedented views of the taxonomic diversity, metabolic 
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potential, and ecological roles of microbial communities in vari- 
ous habitats (12-32). Various studies clearly demonstrate that 
large-scale sequencing approaches are powerful in studying mi- 
crobial community diversity and activity (23, 33-43). 

/3-Diversity, the site-to-site variability in species composition, 
is crucial in understanding patterns of species diversity across var- 
ious spatial and temporal scales. It can provide insight on the 
mechanisms controlling community compositions and structure 
(44). J3-Diversity is widely used for investigating the mechanisms 
controlling biodiversity (45-47) and the responses of biological 
communities to environmental changes (45). In microbial ecol- 
ogy, high-throughput metagenomics sequencing and associated 
technologies are the major tools for examining the site-to-site 
variability in species composition and its response to environ- 
mental changes (48, 49). However, the amplicon-based approach 
has shown limited reproducibility, especially when examining 
low-abundance taxa (50). For instance, based on the overlap of 
operational taxonomic units (OTUs), a very low reproducibility 
(<20% between two technical replicates and <10% among three 
technical replicates) was obtained (50), which is far away from the 
theoretical expectation of 100% overlap among technical repli- 
cates. Similar results were recently obtained with mock commu- 
nities (51) and human microbiomes (13). 

The high numbers of variations in technical replicates are most 
likely due to the sampling artifacts associated with random sam- 
pling processes (50, 52), because many steps in the pyrotag-based 
sequencing analysis are associated with random sampling, e.g., 
PCR amplification of target genes, ligation of amplified PCR prod- 
ucts to sequencing adaptors, emulsion and immobilization of 
beads, and bead deposition (50). Since microbial communities 
under natural settings are extremely complex and generally have 
abundance curves with very long tails, that is, large portions of 
OTUs exist in extremely low abundance, the probability of sam- 
pling such rare OTUs in a sampling event is low. The chances of 
resampling them are even lower, especially with limited sampling 
efforts (i.e., percentages of total individuals sampled in a commu- 
nity). It is expected that the severity of such sampling artifacts on 
community comparison is dependent on community complexity 
and sampling efforts. As the complexity of a microbial community 
increases, such an artifact will become more severe. Increasing 
sampling efforts will help to ameliorate such a problem (50). 
However, there is no theoretical foundation to support such a 
speculation. 

In this study, we hypothesize that a random sampling process is 
the main cause for high numbers of variations among technical 
replicates. To test this hypothesis, the main objective of this study 
is to provide a theoretical foundation on understanding such sam- 
pling artifacts associated with a random sampling process. We first 
developed a theoretical framework to simulate the random sam- 
pling processes based on general sampling theory and to predict 
sampling efforts for achieving a desired reproducibility. We then 
illustrated the sampling artifacts associated with random sam- 
pling processes using an analogous example. We also examined 
whether the developed framework could be used to predict the 
low percentage of overlap of OTUs among technical replicates. 
Our results indicated that high numbers of variations in technical 
replicates were due to the artifacts associated with random sam- 
pling processes. 

Mathematical framework, (i) Sampling individuals from a 
large regional community. The pattern of species abundances 



across different spatial-temporal scales is a central issue in ecol- 
ogy. However, it is generally impossible to directly measure the 
abundance of all species at ecologically relevant scales. Thus, it is 
important to understand the relationship between the underlying 
species abundance distribution of a large regional community and 
the observed abundance distribution in a small sample from the 
large regional community. Various statistical sampling theories 
are developed to describe the relationships of the species abun- 
dance between the sampled community and the large regional 
community (53). Here, we will use general sampling theory to 
simulate and predict the sampling artifacts associated with ran- 
dom sampling processes. 

Assuming that a species (or OTU) occurs in a large regional 
community, the number of its individuals that exist in a small 
sample of the large community is dependent on the total abun- 
dance of that species in the large community, the size of the sam- 
ple, and the spatial distribution of the individuals (54). In this 
study, we assume that all individuals in the large regional commu- 
nity are located randomly in space. Let N represent the total num- 
ber of individuals (e.g., 16S rRNA gene sequences) and n be the 
number of different species (e.g., OTUs), each with abundances 
Xj, x 2 , . . ., x n . If an individual is randomly sampled from the com- 
munity, the probability of it belonging to the i th species is x/N. If 
m individuals are randomly sampled from this large community, 
the expected number of individuals from the i th species is mx/N. 
With replacement, then, the probability of the i th species with 
abundance x, to be encountered by k individuals in the sample is 
given by the following binomial function expression (55, 56): 

p(k\ X „N,m) = (^){^)\l-^y k (1) 

To estimate the probability that at least one individual of the i th 
species is present in the sample, we generally calculate the proba- 
bility that the i th species is absent in the sample, i.e., k = 0. If k = 0, 
then the probability of the z' th species with abundance x i in a com- 
munity to be absent in the sample is expressed as 

p(0\x i ,N,m) = (l-~) (2) 

Let xJN equal axjm, where a = m/N, the sampling ratio. Then, 
equation 2 is approximated to the exponential form of the Poisson 
distribution: 

\imp(0\x i ,N,m) = e- m (3) 

mr-xx> 

According to the Poisson distribution, the probability of the z th 
species with abundance x i in the community to have at least one 
individual in the sample can be expressed as follows (54): 

ip(a, x^ = 1 - p(0\x h N, m) « 1 - e~ m (4) 
The Poisson distribution is the simplest model for sampling indi- 
viduals from a large regional community. Based on general sam- 
pling theory (53, 56), the abundance distribution observed in a 
sample that constitutes a proportion a of the large regional com- 
munity can be expressed as 

<W m ) = j *\>(a,x)§(x,Q)dx (5) 

0 

where tp a (m) is the observed species abundance distribution in a 
sample with m individuals sampled. ip(x) represents the species 
abundance distribution with abundance x in the large regional 
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community, in which 6 is the vector of parameters (57). 

(ii) Expected species overlap among samples with the size of 
the large community known. Given a large community with a 
known underlying species abundance distribution, cp(x), the ex- 
pected proportion of species in common between samples can be 
theoretically predicted. In the case in which the species abun- 
dances in these samples are perfectly correlated, the expected 
number of the species shared between two samples is given by (54) 



\\>(a l ,x)\\s{a 2 ,x)(^(x, Q)dx (6) 



and the number of species shared among three samples is 



\\i{a u x)\\i(a 2 , x)\\i(a 3 , x)4>(x, Q)dx (7) 



where a v a 2 , and a 3 are sample ratios from samples 1, 2 and 3, and 
a x = mJN,l = 1, 2, 3. m 1 is the number of individuals (i.e., 16S 
rRNA gene sequences in this study) obtained from the I th sample. 

Various similarity metrics are used for assessing overlap 
among different samples (58). In this study, the two popular sim- 
ilarity metrics, Jaccard's incidence-based and Bray-Curtis's 
abundance-based methods, are used. Based on the Jaccard simi- 
larity index, the proportion of species (i.e., OTUs in this study) 
overlap between two samples [C^(fl 1 ,a 2 ,0)] is calculated as follows: 



^(flj, x)i\i(a 2 , x)$(x, Q)dx 

OK«i> «» 0) = — ^ (8) 

\\i(a 1 ,x)(\>(x, Q)dx 



+ j \\>(a 2 ,x)$(x,Q)dx 
o 



i|/(fli, x)\\i(a 2 , x)$(x, Q)dx 



The proportion of species overlap among three samples 
[0 J 3 (a 1 ,fl 2 ,a 3 ,e)] is 



\\i(a lt x)\\i (a 2 , x)\\i(a 3 , x)$(x, Q)dx 



D, 



(9) 



where 



D } = J ty(a 1 ,x)$(x,Q)dx+ \ i\>{a 2 ,x)$(x,Q)dx 

0 



+ J \\>(a } , x)$(x, Q)dx — J i|)(a 1 ,x)v|)(a 2 ,x)4)(x, Q)dx 

0 0 



i[i(a 2 , x)4i(a 3 , x)4>(x, Q)dx — J x)ty(a 3 , *)<K*> Q)dx 

o o 

+ f \\>(a 1 ,x)\\>(a 2 ,x)\\i(a 3 ,x)$(x,Q)dx 

o 



There are seven commonly used continuous species abun- 
dance distributions, including lognormal, exponential, gamma, 
truncated hyperbolic, continuous log series (54), inverse gamma, 
and inverse Gaussian distribution. When the total number of in- 
dividuals of the regional community (N) is known, the Jaccard 
similarity-based explicit expression functions of the proportion of 
species overlap between two or three samples are summarized in 
Tables SI A and SIB in the supplemental material. 

(iii) Expected species overlap among samples with the size of 
the large community unknown. Using equations 8 and 9 to esti- 
mate the percentages of species overlap between two or three sam- 
ples requires information about the total individuals (AT) of the 
communities examined. However, in most cases, the number of 
total individuals in a community is unknown. In the following, we 
will consider the situation in which N is unknown. 

Since most of the species abundance distributions are scale 
invariant under a Poisson sampling process (53, 57, 59, 60), the 
expected sample abundance distribution can be obtained by res- 
caling the community abundance distribution (x) to the sample 
abundance distribution (y):y = px, where p is the proportion of 
the community sampled. Thus, equation 8 can be rewritten as 



O, 2 *(<?;, 02,8*) 



(10) 



A?{a\,y)ty{a 2 ,y)${y,Q*)dy 



T\s{a\,y)§(y,Q*)dy 



\\i(a 2 ,y)$(y,Q*)dy 



\\i{a],y)\\i(a" 2 ,y)$(y,Q*)dy 



whereai, a 2 , and 6* are the sample ratios and the vector of param- 
eters when N is unknown, which are the rescaled parameters and 
functions of p. In the case of two random samples from the same 
community with parameters 0, we setp = a 1 + a 2 and then obtain 
a\ = + a 2 ) and d 2 = fl 2 /(fl 1 + a 2 ). Let m 1 and m 2 be the total 

number of individuals observed in sample 1 and sample 2, respec- 
tively. Now we can substitute a t = mJN and a 2 = m 2 /N into and 
d 2 , and then both a\ and d 2 are the functions of m 1 and m 2 , i.e., d x = 
m l /(m l + m 2 ) and d 2 = m 2 l{m l + m 2 ). 6* is the vector of param- 
eters of the species abundance distribution in the samples. Con- 
sequently, the expected species overlap can be obtained by fitting 
data to the parameters 6* without knowing the total number of 
individuals of the community, N. 

Similarly, in the case of three samples randomly sampled from 
one community, the parameter p can be set to the sum of sampling 
ratios, p = a x + a 2 + a 3 . Then, equation 9 can be rewritten as 



i|;(aj,;K)iKa 2 ,y)iK(4y)<Ky, ®*)dy 



<J 1 ,*(d 1 ,d 2 ,d 3 ,Q* ) 



(11) 



where 
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3 = Ma\,y)4>(y,Q*)dy 



+ ^{a 2 ,y)$(y,Q*)dy 



Mai,yMal,y)^(y,Q*)dy (12) 



(a\,y)\\)(al,y)$(y,Q*)dy 



\\i(d 2 ,y)\\i(d 3 ,y)$(y, Q*)dy 



ty{al>y)M<h,y)W a l>y)<b(y' ®*)dy 



and a\ = a 1 /(a 1 + a 2 + a 3 ) = m 1 /(m 1 + m 2 + m 3 ), d 2 = a 2 /(a 1 + 
« 2 + « 3 ) = m 2 /(m 1 + m 2 + m 3 ), d 3 = a 3 l(a 1 + a 2+ + a 3 ) = m 3 l(m l 
+ m 2 + m 3 ). 

Now, according to equations 10 and 11, the species overlap for 
two and three samples can be estimated based on sample abun- 
dance y rather than community abundance x, so that there is no 
need to know the community size N. When the total number of 
individuals of the regional community is unknown, the Jaccard 
similarity-based explicit expression functions of the proportion of 
species overlap between two or among three samples are summa- 
rized in Tables SIC and SID in the supplemental material. 

(iv) Predicting sampling efforts for achieving a desired over- 
lap among replicate samples. One important question in practice 
is what levels of sampling efforts are needed for achieving desired 
species overlap when the number of total individuals of the com- 
munity is unknown. In the following, we will address this practical 
question by assuming that and m l and m ; individuals are needed 
for sampling to achieve a desired overlap between two samples. To 
simplify the situation, let m x = m 2 = m, and sampling ratio A = 
m IN. Based on equation 13, when Nis known, the predicted over- 
lap between two samples is given by 



i|jCA, x)»|i(A, x)${x, 



or 



(13) 



\\>(A,x)$(x,Q)dx+ \\>{A,x)$(x, Q)dx 



\\i(A, x)\\i(A, x)${x, Q)dx 



Similarly, when N is unknown, we can take the transformation 
of the abundance variable y = px = (a l + a 2 )x. By comparing 
equations 8 and 1 0, the relationship of the expected number of the 
shared species when N is known or unknown is given by 



*\i{A,x)$(x,Q)dx= 4i(A* ,y)$(y,Q*)dy (14) 



where A* 



A 



m'A.niY + m 2 ) is the predicted sample ratio 



for achieving the desired overlap between two samples, and 0* is 
the vector of scaled parameters. Rearranging A* as the expression 
of the number of sequences, A* = m'/(m 1 + ra 2 ), the predicted 
overlap model is given by 



Gj 



,2' * 



{x\i[m ' /(m, + m 2 ), y]} 2 $(y, 6 * )dy 



2 {i|j[m , /(m 1 + m 2 ), y]}4>(y, 



(15) 



{i|j[m' /(m 1 + m 2 ), x]} 2 o>(y, Q*)dy 



Similarly, the predicted overlap model for three samples is 
given by 



{\\i[m ' /(m l + m 2 + m 3 ),y] } 3 <Ky, Q*)dy 



or 



3 4*[ m ' /( m i + m 2 + fn 3 ),y]$(y, Q*)dy 



3 {*\)[m ' /{m l + m 2 + m 3 ),y\} 2 §(y, Q*)dy 



+ | {v|)[m ' /(m l + m 2 + m 3 ),y]} 3 4>(y, Q*)dy 

(16) 

By solving equations 15 and 16, the sampling efforts m' can be 
estimated based on sample abundance y without knowing the 
community size N. The Jaccard similarity-based explicit expres- 
sion functions of the proportion of species overlap between two or 
among three samples for predicting sampling efforts are summa- 
rized in Tables SIC and SID in the supplemental material. 

RESULTS 

Simulation with an analogous example. To better illustrate the 
effects of random sampling processes on the OTU overlap among 
technical replicates ( 3 ) , we use an analogous example by randomly 
sampling balls from three j ars containing the same number of balls 
of different colors (Fig. 1 ) . Assume that three identical jars contain 
N balls of n different colors. Their abundance distributions vary 
among different colors of balls but are identical among these jars. 
Here, individual balls are equivalent to individual 16S rRNA gene 
sequences, while the types of balls with different colors are equiv- 
alent to individual OTUs. To simplify the situation, we assume 
that the same numbers of balls, m, are randomly sampled from 
these three jars, yielding samples 1, 2, and 3 (Fig. 1). Theoretically, 
if all balls from the whole jars are sampled (i.e., m = N), 100% ball 
overlap will be expected among these samples. However, in reality, 
the percentages of ball overlap will be less than 100%, because they 
depend on the sampling efforts, ball abundance distribution, and 
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Three identical jars (communities) 



3kl 



n colors of balls (species) 
with identical distribution 



Randomly selected m balls 



(sampling) 




3*3 



W balls (individuals) 



Randomly selected m balls ^ 
(sampling) 




What percentage of 
overlapped colors 
(species) among 
samples? 



Randomly selected m balls 



(sampling) 




FIG 1 An analogous example to simulate random sampling processes. Three identical jars contain the 
same number and types of balls, with identical ball abundance distribution. 



complexity of the community. The differences of the percentages 
of ball overlap between the theoretically predicted and the ob- 
served values among different sampling events are entirely due to 
random sampling processes, because there is no difference in the 
ball compositions and abundances among these three jars. 

With the assumption that the ball abundance distributions in 
these three jars follow any of the five continuous species abun- 
dance distributions as listed in Table SI A in the supplemental 
material, we simulated the effects of the random sampling pro- 
cesses on the ball overlap based on Jaccard and Bray-Curtis in- 
dexes for five different distributions: exponential, gamma, log- 
normal, inverse Gaussian, and inverse gamma. Under each 
specific ball abundance distribution, the average observed overlap 
through simulations of 100 repeated samplings (see Materials and 
Methods for details) was calculated and compared to the theoret- 
ically predicted overlap between two samples based on equation 8 
or 10, when Wis known or unknown, respectively. Similar analy- 



ses were carried out for comparing ball 
overlaps among three samples based on 
equation 9 or 1 1 . Although the simulation 
results vary considerably with the param- 
eters selected, the following generaliza- 
tions can be drawn. First, no significant 
differences of Jaccard similarities were 
observed between the theoretically pre- 
dicted and the observed overlap values for 
all five species abundance distributions 
(Table 1 and Fig. 2; see also Fig. SI and S2 
in the supplemental material), indicating 
that the theoretically predicted percent- 
ages of ball overlap match well to the ob- 
served percentages of ball overlap for all 
ball abundance distributions examined. 
Second, there were no significant differ- 
ences of the predicted overlap percentages 
derived from equation 8 with known N 
and equation 10 with unknown N for two 
samples or from equation 9 with known 
N and equation 1 1 with unknown N for 
three samples (see Table S2 in the supple- 
mental material), suggesting that accurate 
predictions of the overlap percentages be- 
tween two or among three samples can be 
obtained when the size of the regional 
communities is unknown. In addition, 
low ball overlap percentages were ob- 
served with low sampling efforts under 
different ball abundance distributions 
(Fig. 2; see also Fig. SI and S2). For instance, under exponential 
distribution, when 1% of the community was sampled, 50% over- 
lap between two samples (Fig. 2A) and 34% overlap among three 
samples (Fig. 2B) were obtained. As the sampling efforts increase, 
the ball overlap among samples increases under different ball 
abundance distributions (Fig. 2; see also Fig. SI and S2). For ex- 
ample, under exponential distribution, when 10% of the commu- 
nity was sampled, 91% ball overlap between two samples (Fig. 2A) 
and 82% among three samples were obtained (Fig. 2B). All of the 
above results suggested that accurate predictions of the overlap 
could be obtained between two samples with equation 8 or 10 and 
among three samples with equation 9 or 11. 

Because the general explicit form of the ball overlap for quan- 
titative similarity index could not be derived, numerical simula- 
tions were performed (see Materials and Methods) for all five ball 
abundance distributions (see Fig. S3 in the supplemental material) 



TABLE 1 Chi-square-based goodness-of-fit test of the observed and predicted percentages of overlap for the analogous example 0 



OTU abundance 
distribution 


Two samples 








Three samples 








Known N 




Unknown N 




Known N 




Unknown N 




X 2 


P 


X 2 


P 


X 2 


P 


X 2 


P 


Exponential 


6.6 X 1(T 5 


0.999 


6.6 X 10~ 5 


0.999 


2.7 X 10" 4 


0.999 


2.7 X 10- 4 


0.999 


Gamma 


1.9 X 1(T 6 


0.999 


5.1 X 10- 4 


0.999 


9.4 X 10- 6 


0.999 


6.5 X 10~ 3 


0.999 


Lognormal 


4.2 X 1(T 4 


0.999 


2.1 X 10~ 4 


0.999 


1.1 X 10~ 3 


0.999 


1.1 X 10-3 


0.999 


Inverse gamma 


4.0 X 10-" 


0.999 


2.1 X 10- 4 


0.999 


6.5 X 10- 6 


0.999 


2.2 X 10- 3 


0.999 


Inverse Gaussian 


3.3 X 1(T 5 


0.999 


9.1 X 10~ 4 


0.999 


7.5 X 10- 5 


0.999 


8.5 X 10- 4 


0.999 



" Detailed information is presented in Fig. S2 and S3 in the supplemental material. 
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Sampling effort (a, =a 2 ) Sampling effort (a A =a 2 =a 3 ) 

FIG 2 The relationships between the expected Jaccard overlaps of ball colors and sampling efforts under the exponential abundance distribution, assuming the 
community has 10 6 individual balls and 10 4 types of balls, with different colors. Distribution parameter is set to A = 1 X 10~ 2 . In each case, we calculated the 
theoretically predicted overlap (blue line) by equation 8 when N is known, the predicted overlap (red line) by equation 10 when N is unknown, and the average 
observed overlap (point) through simulations of 100 repeated samplings. (A) Two samples. The sample ratio is a 1 = u 2 ' (B) Three samples. The sample ratio is 
«! = fl 2 = a 3 . 



for comparisons based on Jaccard similarity, all of the numerical 
simulations indicated that low percentages of overlap were also 
observed among samples with low sampling efforts. For instance, 
under exponential distribution, when 1% of the community was 
sampled, 53% of ball overlap based on the Bray-Curtis index was 
obtained between two samples and 35% among three samples. All 
of these results, based on both Jaccard and Bray-Curtis similari- 
ties, indicate that substantially high numbers of variations can be 
obtained among communities of identical compositions and 
abundance distributions when sampling effort is low. Such varia- 
tion is entirely due to random sampling processes. 

Empirical examples. In our previous studies, the composition 
and structure of 24 microbial communities from a long-term 
global change experimental site in Oklahoma were analyzed using 
the amplicon-based sequencing detection approach (49). Each 
community was amplified with 2 or 3 bar-coded primers, followed 
by sequencing using both forward and reverse primers. Thus, 
three types of datasets were available: sequences from forward and 
reverse primers and combined sequences. Since the sequences of 
individual technical replicates within an experimental plot are de- 
rived from the same community, conceptually, they should obey 
the same species abundance distribution. Thus, the sequences 
from all technical replicates within a plot (a community) were 
pooled for fitting species abundance models described in Ta- 



ble SI A in the supplemental material. The best-fit model and as- 
sociated parameters for each plot (i.e., community) were shown in 
Table S3A for the soil communities amplified with two tags and in 
Table S3B for the soil communities amplified with three tags. Ei- 
ther the exponential abundance distribution or the inverse 
gamma distribution is the preferred OTU abundance distribution 
for all soil samples (see Tables S3 A and S3B). 

Once the OTU abundance distribution models are determined 
for all communities examined, the predicted overlap percentages 
in terms of Jaccard and Bray-Curtis similarities were calculated 
based on the formula provided in Tables SIC and SID in the 
supplemental material. The results were listed in Tables S4A and 
S4B for the communities amplified with two tags and in Ta- 
bles S5A and S5B for those with three tags. Overall, no significant 
differences were observed between observed and predicted Jac- 
card overlaps (Table 2). 

Since Bray-Curtis overlap is a quantitative similarity index, it is 
generally higher than the incidence-based overlap (see Tables S5A 
and S5B in the supplemental material). Overall, no significant 
differences were observed between observed and predicted Bray- 
Curtis overlap percentages (Table 2). All of the above-reported 
results indicated that the predicted overlap percentages match 
well to the observed results. Therefore, the variations among tech- 



TABLE 2 Chi-square-based goodness-of-fit test of the observed and predicted percentages of overlap for the experimental data" 



Communities 
amplified with: 


Similarity 
test 


Forward primer 




Reverse primer 




Combined 




X 2 


P 


X 2 


P 


X 2 


P 


Two tags 


Jaccard 


0.040 


0.999 


0.060 


0.999 


0.038 


0.999 




Bray-Curtis 


0.106 


0.999 


0.122 


0.999 


0.075 


0.999 


Three tags 


Jaccard 


0.006 


0.999 


0.010 


0.999 


0.005 


0.999 




Bray-Curtis 


0.008 


0.999 


0.026 


0.999 


0.016 


0.999 



a Detailed data are listed in Table S4A {two tags, Jaccard), S4B (two tags, Bray-Curtis), S5A (three tags, Jaccard}, and S5B (three tags, Bray-Curtis) in the supplemental material. 
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FIG 3 Prediction of sampling efforts for desired OTU overlap. (A) Desired overlap between two tags based on the combined sequences from sample 2UC. The 
sampling efforts were calculated based on equation 15. The parameters for species abundance distribution were from Table S3A in the supplemental material. (B) 
Desired overlap among three tags based on the combined sequences from sample 1UC. The sampling efforts were calculated based on equation 16. The 
parameters for species abundance distribution were from Table S3B. 



nical replicates can be best explained by the artifacts associated 
with random sampling processes. 

Predictions of sampling efforts for the desired species over- 
lap. Based on equations 15 and 16, on the explicit expression func- 
tions presented in Tables SIC and SID, and on sample abundance 
and sequence information for various tags in Tables S3A and S3B, 
sampling efforts for achieving various degrees of OTU overlaps 
between two (see Table S6A) and among three technical replicates 
(see Table S6B) were predicted. To achieve 20% OTU overlap for 
two tags, an average of 2,367 sequences are needed (see Table S6A; 
Fig. 2A, 2UC), which is consistent with the number of sequences 
obtained in this experiment. To achieve 90% overlap between two 
technical replicates, an average of 60,900 sequences are needed 
(see Table S6A). For instance, for the community of 2UC, a total of 
71,400 of the combined sequences would be needed to achieve 
90% OTU overlap between two technical replicates (Fig. 3A), 
which is about 32 times more sequence reads needed than we 
sequenced previously (see Table S4 to S6). 

Much more sequencing effort is needed to achieve desired 
OTU overlaps among three technical replicates than between two 
technical replicates. To have 10% OTU overlap for three tags, an 
average of 3,310 sequences are needed, which is consistent with 
the number of sequences obtained in this experiment. To reach 
90% overlap between three technical replicates, an average of 
63,770 sequences are needed (see Table S6B). For example, for the 
community of 1UC, about 60,500 sequences are required to ob- 
tain 90% of OTU overlap (Fig. 3B). The current sampling efforts 
(2,018 sequences) are far less than the desired 90% OTU overlap. 
Our results also suggested that most of the work published, espe- 
cially with soils, is severely undersampling if the goal is to deter- 
mine significant changes in j8-diversity among sampling sites. 

DISCUSSION 

One of the major technical challenges for the amplicon-based se- 
quencing detection approach is low reproducibility (50), which is 
a central issue in comparative studies (61). This issue has recently 



been examined intensively, but it is still a matter of debate (13,31, 

51) . Results from several recent studies supported (13, 50, 51, 62), 
disputed (31, 63-65) or both supported and disputed (66) our 
previous observations. A number of factors can contribute to such 
divergent observations, e.g., the complexity of the systems exam- 
ined (50, 51, 64), differences in sequencing depths (50, 62, 63), 
and/or variations in sequencing and sequence preprocessing ap- 
proaches (51, 67). Since most of these factors act with each other, 
isolating individual factors influencing sequencing reproducibil- 
ity is extremely difficult, especially when natural communities of 
unknown diversity background are examined. Using artificial 
communities of known diversity, Pinto and Raskin (51) provided 
explicit evidences for the poor reproducibility of the amplicon 
sequencing-based detection approach, even with the simple com- 
munity and relatively deep sequencing. Thus, poor reproducibil- 
ity is a problem inherent in the amplicon sequencing-based detec- 
tion approach (50). 

Such poor reproducibility among technical replicates could 
result from PCR amplification biases (67-72), sequencing errors 
(51, 67, 71, 72), and/or the artifacts associated with random sam- 
pling processes during sample preparation and sequencing (50, 

52) . In this study, using analogous ball examples, we showed that 
very low percentages of overlap were observed among replicate 
samples from the community with identical ball types and num- 
bers when the sampling effort was low, which is very consistent 
with what we observed experimentally (50). In addition, under 
different OTU abundance distributions, the obtained overlap per- 
centages among two or three technical replicates were very con- 
sistent with the theoretical predicted values under the assumption 
of random sampling processes. The simulation results presented 
in this study provided explicit evidences of the contributions of 
random sampling processes to the high numbers of variations 
observed among technical replicate samples. It should be noted 
that spurious OTUs due to sequencing errors could also contrib- 
ute to such technical variations, but they will be indistinguishable 
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from rare abundant OTUs which can be detected only sporadi- 
cally in replicate samples (51). 

The low reproducibility of the amplicon sequencing-based de- 
tection approach associated with random sampling processes 
raises a concern of comparing the j3-diversity of microbial com- 
munities across different samples for amplicon sequencing. Al- 
though the inherent high numbers of variations associated with 
random sampling processes have less effect on a-diversity, it is 
problematic in estimating j3-diversity (50, 51), which presents a 
significant challenge for comparative studies across different spa- 
tial and temporal scales. However, the degrees of such effects on 
estimating /3-diversity are dependent on the complexity of the 
community examined and sampling efforts. In general, as the 
complexity increases, such problems will be more severe (50) and 
greater sampling efforts will be needed (50, 62). Nevertheless, 
great caution needs to be taken for estimating community based 
on OTUs when the sequences examined are not enough to repre- 
sent community diversity (51, 62). 

Determining the patterns and distribution of species abun- 
dance is a central issue in ecology (53), because they are important 
in studying both basic ecological theory and applied biodiversity 
conservation. However, direct measurement of species abundance 
in ecologically relevant scales is difficult, if not impossible. In- 
stead, the distribution of species abundance in ecology is generally 
inferred based on limited samples under Poisson sampling. 
Therefore, the artifacts associated with random sampling pro- 
cesses observed in this study for microbial communities should 
also be applicable to the other ecological studies in plants and 
animals, although such problems could be less severe in macro- 
ecology. However, to the best of our knowledge, such an issue has 
not been addressed in the ecological literature, but it is of critical 
importance in studying species distribution, especially species- 
area relationships (52, 73). Along with our previous efforts (50, 
52), this study clearly demonstrates the importance of random 
sampling processes in estimating microbial biodiversity, espe- 
cially j3-diversity. The general conclusions learned from this study 
should be applicable to other ecological studies in general. 

In conclusion, the factors causing variations in /3-diversity are 
among the most important but poorly understood issues in ecol- 
ogy, because they are the key mechanisms influencing global vari- 
ation in biodiversity. Next-generation sequencing technologies, 
such as PCR amplicon sequencing-based detection approaches, 
have been rapidly used for characterizing microbial biodiversity, 
but they also suffer from several inherent drawbacks, such as a 
high number of sequencing errors, biases, and poor reproducibil- 
ity and quantitation. Through mathematical modeling and simu- 
lation based on general sampling theory, this study provides ex- 
plicit evidences of random sampling processes as the main factor 
causing high percentages of technical variations and develops a 
framework for predicting sampling efforts for achieving the de- 
sired technical reproducibility. Since most ecological studies are 
involved in random sampling, the artifacts associated with ran- 
dom sampling processes observed in microbial ecology should 
also be applicable to macroecology, although such problems could 
be less severe there. Because such artifacts greatly overestimate 
/3-diversity, great caution should be taken when the amplicon 
sequencing-based detection is used for drawing quantitative con- 
clusions about j3-diversity. Increasing sampling efforts and/or the 
number of sample replicates (both technical and biological) 
should be the most effective ways to ameliorate technical repro- 



ducibility for drawing more reliable quantitative conclusions, but 
how to balance the sampling efforts and number of samples ana- 
lyzed per sequencing run is dependent on biological questions and 
objectives, as well as the complexity and similarity of communities 
examined. 

MATERIALS AND METHODS 

The details for all materials and methods used in this study are provided in 
the supplemental material (see Text SI). Briefly, based on general sam- 
pling theory (54-56), a mathematical framework was first developed un- 
der 7 different OTU abundance distributions for simulating the effects of 
random sampling processes on quantifying j3-diversity when the commu- 
nity size is known or unknown. Second, an analogous ball example was 
used to explicitly illustrate the effects of random sampling processes on 
the OTU overlap among technical replicates. Third, the theoretical mod- 
els were fitted with the empirical experimental sequencing data from our 
previous studies (50), in which a total of 24 soil communities from a 
long-term climate change experiment facility (74) were sequenced with 60 
tags. An average of 1,121 ± 390 OTUs were obtained for each tag based on 
the combined samples. In addition, a y 2 test was employed to determine 
whether the predicted OTU overlaps were consistent with the observed 
values. 

SUPPLEMENTAL MATERIAL 

Supplemental material for this article may be found at http://mbio.asm.org 
/lookup/suppl/doi: 10.11 28/mBio.00324- 1 3/-/DCSupplemental. 
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